Goto

Collaborating Authors

 critical fault


RIFT: A Scalable Methodology for LLM Accelerator Fault Assessment using Reinforcement Learning

Khalil, Khurram, Khaliq, Muhammad Mahad, Hoque, Khaza Anuarul

arXiv.org Artificial Intelligence

Abstract--The massive scale of modern AI accelerators presents critical challenges to traditional fault assessment methodologies, which face prohibitive computational costs and provide poor coverage of critical failure modes. This paper introduces RIFT (Reinforcement Learning-guided Intelligent Fault T argeting), a scalable framework that automates the discovery of minimal, high-impact fault scenarios for efficient design-time fault assessment. RIFT transforms the complex search for worst-case faults into a sequential decision-making problem, combining hybrid sensitivity analysis for search space pruning with reinforcement learning to intelligently generate minimal, high-impact test suites. Evaluated on billion-parameter Large Language Model (LLM) workloads using NVIDIA A100 GPUs, RIFT achieves a 2.2 fault assessment speedup over evolutionary methods and reduces the required test vector volume by over 99% compared to random fault injection, all while achieving superior fault coverage. The proposed framework also provides actionable data to enable intelligent hardware protection strategies, demonstrating that RIFT -guided selective error correction code provides a 12.8 improvement in cost-effectiveness (coverage per unit area) compared to uniform triple modular redundancy protection. RIFT automatically generates UVM-compliant verification artifacts, ensuring its findings are directly actionable and integrable into commercial RTL verification workflows. The recent advent of Large Language Models (LLMs) with hundreds of billions of parameters has had a transformative impact on computing, but has also introduced unprecedented computational demands [1].


High-Dimensional Fault Tolerance Testing of Highly Automated Vehicles Based on Low-Rank Models

Mei, Yuewen, Nie, Tong, Sun, Jian, Tian, Ye

arXiv.org Artificial Intelligence

Ensuring fault tolerance of Highly Automated Vehicles (HAVs) is crucial for their safety due to the presence of potentially severe faults. Hence, Fault Injection (FI) testing is conducted by practitioners to evaluate the safety level of HAVs. To fully cover test cases, various driving scenarios and fault settings should be considered. However, due to numerous combinations of test scenarios and fault settings, the testing space can be complex and high-dimensional. In addition, evaluating performance in all newly added scenarios is resource-consuming. The rarity of critical faults that can cause security problems further strengthens the challenge. To address these challenges, we propose to accelerate FI testing under the low-rank Smoothness Regularized Matrix Factorization (SRMF) framework. We first organize the sparse evaluated data into a structured matrix based on its safety values. Then the untested values are estimated by the correlation captured by the matrix structure. To address high dimensionality, a low-rank constraint is imposed on the testing space. To exploit the relationships between existing scenarios and new scenarios and capture the local regularity of critical faults, three types of smoothness regularization are further designed as a complement. We conduct experiments on car following and cut in scenarios. The results indicate that SRMF has the lowest prediction error in various scenarios and is capable of predicting rare critical faults compared to other machine learning models. In addition, SRMF can achieve 1171 acceleration rate, 99.3% precision and 91.1% F1 score in identifying critical faults. To the best of our knowledge, this is the first work to introduce low-rank models to FI testing of HAVs.


Testing Spintronics Implemented Monte Carlo Dropout-Based Bayesian Neural Networks

Ahmed, Soyed Tuhin, Hefenbrock, Michael, Prenat, Guillaume, Anghel, Lorena, Tahoori, Mehdi B.

arXiv.org Artificial Intelligence

Bayesian Neural Networks (BayNNs) can inherently estimate predictive uncertainty, facilitating informed decision-making. Dropout-based BayNNs are increasingly implemented in spintronics-based computation-in-memory architectures for resourceconstrained yet high-performance safety-critical applications. Although uncertainty estimation is important, the reliability of Dropout generation and BayNN computation is equally important for target applications but is overlooked in existing works. However, testing BayNNs is significantly more challenging compared to conventional NNs, due to their stochastic nature. In this paper, we present for the first time the model of the non-idealities of the spintronics-based Dropout module and analyze their impact on uncertainty estimates and accuracy. Furthermore, we propose a testing framework based on repeatability ranking for Dropout-based BayNN with up to 100% fault coverage while using only 0.2% of training data as test vectors. Bayesian Neural Networks (BayNNs) offer substantial benefits over conventional neural networks (NNs), particularly in safety-critical applications where reliability and confidence in prediction are paramount [1]. Unlike traditional NNs, BayNNs can inherently capture and estimate the uncertainty of their predictions, enhancing decision-making under uncertain conditions. However, their implementation faces significant computational bottlenecks, especially on edge devices. Spintronics-based computation-in-memory (Spintronics-CIM) architectures are a promising solution for the hardware realization of BayNNs as they mitigate some of the inherent computational costs, balancing high-performance demands with the constraints of resourcelimited devices.


Ranger: Boosting Error Resilience of Deep Neural Networks through Range Restriction

Chen, Zitao, Li, Guanpeng, Pattabiraman, Karthik

arXiv.org Machine Learning

With the emerging adoption of deep neural networks (DNNs) in the HPC domain, the reliability of DNNs is also growing in importance. As prior studies demonstrate the vulnerability of DNNs to hardware transient faults (i.e., soft errors), there is a compelling need for an efficient technique to protect DNNs from soft errors. While the inherent resilience of DNNs can tolerate some transient faults (which would not affect the system's output), prior work has found there are critical faults that cause safety violations (e.g., misclassification). In this work, we exploit the inherent resilience of DNNs to protect the DNNs from critical faults. In particular, we propose Ranger, an automated technique to selectively restrict the ranges of values in particular DNN layers, which can dampen the large deviations typically caused by critical faults to smaller ones. Such reduced deviations can usually be tolerated by the inherent resilience of DNNs. Ranger can be integrated into existing DNNs without retraining, and with minimal effort. Our evaluation on 8 DNNs (including two used in self-driving car applications) demonstrates that Ranger can achieve significant resilience boosting without degrading the accuracy of the model, and incurring negligible overheads.